Skip to content

Cosmos3 ModularPipeline#14110

Open
yzhautouskay wants to merge 2 commits into
huggingface:mainfrom
yzhautouskay:yzhautouskay/cosmos3_modular_pipeline
Open

Cosmos3 ModularPipeline#14110
yzhautouskay wants to merge 2 commits into
huggingface:mainfrom
yzhautouskay:yzhautouskay/cosmos3_modular_pipeline

Conversation

@yzhautouskay

Copy link
Copy Markdown
Contributor

What does this PR do?

Summary

  • Add Cosmos3 modular pipeline support via Cosmos3OmniModularPipeline and Cosmos3OmniBlocks.
  • Implement modular Cosmos3 stages for encoding, pre-denoise setup, denoising loop, and decoding.
  • Register/export Cosmos3 modular pipeline components in modular and top-level package mappings.
  • Add Cosmos3 modular documentation and usage section in docs/source/en/api/pipelines/cosmos3.md.
  • Add strict elementwise parity tests across text/image/video, optional sound, and action-conditioned modes.

Test Plan

  • PYTHONPATH=src python -m pytest -q tests/pipelines/cosmos/test_cosmos3_modular_parity.py -vv

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@github-actions github-actions Bot added documentation Improvements or additions to documentation tests modular-pipelines size/L PR with diff > 200 LOC labels Jul 2, 2026

@yiyixuxu yiyixuxu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this!
I did an initial review - I mainly focus on encoder/decoder blocks for now. In modular, these blocks are meant to be run standalone ( e.g. an user encode an image once, keep the latent and reuse them across generations), or combined into a pipeline you can run end-to-end like a standard pipeline.

i will do another pass soon, let me know if you have any questions!

@property
def expected_components(self) -> list[ComponentSpec]:
return [
ComponentSpec("transformer", Cosmos3OmniTransformer),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ComponentSpec("transformer", Cosmos3OmniTransformer),

Comment on lines +35 to +36
ComponentSpec("vae", AutoencoderKLWan),
ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ComponentSpec("vae", AutoencoderKLWan),
ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer),


@property
def description(self) -> str:
return "Validates inputs, tokenizes prompts, and packs text conditioning."

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have this step to just run safety_checker + tokenize things, we want the text encoder block to be meaningful to run standalone, as well as combined into other blocks.

i.e., the user can run it once, keep the text segments, and reuse them across many generations with different resolutions/ conditional inputs/seeds etc

Comment on lines +44 to +47
InputParam(name="image", default=None),
InputParam(name="video", default=None),
InputParam(name="condition_frame_indexes_vision", default=(0, 1)),
InputParam(name="condition_video_keep", default="first"),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InputParam(name="image", default=None),
InputParam(name="video", default=None),
InputParam(name="condition_frame_indexes_vision", default=(0, 1)),
InputParam(name="condition_video_keep", default="first"),

Comment on lines +53 to +55
InputParam(name="guidance_scale", type_hint=float, default=6.0),
InputParam(name="enable_sound", type_hint=bool, default=False),
InputParam(name="action", type_hint=CosmosActionCondition, default=None),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InputParam(name="guidance_scale", type_hint=float, default=6.0),
InputParam(name="enable_sound", type_hint=bool, default=False),
InputParam(name="action", type_hint=CosmosActionCondition, default=None),

Comment on lines +78 to +79
if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs

we do not need to support pipeline callbacks in modular, since it is so easy to insert/swap blocks

if block_state.width is None:
block_state.width = 1280

components.check_inputs(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only need to check inputs used in this block (i think you cannot directly reuse the check_inputs method from the standard pipeline)

condition_frame_indexes_vision=block_state.condition_frame_indexes_vision,
)

block_state.action_mode = block_state.action.mode if block_state.action is not None else None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we give action its own text block? a Cosmos3ActionTextStep that takes prompt + action and then build the action json prompt + resolution bining + tokenize ...

and then you can wrap this step( Cosmos3TextEncoderStep) and Cosmos3ActionTextStep into a AutoPipelineBlocks (e.g. Cosmos3AutoTextEncoderStep) triggered on action. this way each mode's text logic stays self-contained and more readable

see an example here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/qwenimage/modular_blocks_qwenimage_edit.py#L200

this is an auto vae encoder step, but it should work similarly for text step as well



logger = logging.get_logger(__name__)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate the VAE encoding from prepare_latent and add a proper Cosmos3VaeEncoderStep here?

We probably need a Cosmos3VaeEncoderStep (for i2v and v2v) and a Cosmos3ActionVaeEncoderStep, and pack them into an auto-step triggered on image/video/action.

similar to text step, the Vae encoder step should also be able to run standalone when needed - a user should be able to run just the vae encoder once, keep the latents and reuse them across generations.

logger = logging.get_logger(__name__)


class Cosmos3DecodeStep(ModularPipelineBlocks):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should split by modality as well, so Cosmo3VideoDecoderStep and Cosmos3SoundDecoderStep(the sound one can go into an auto block so it only runs if sound_latents is not None, like https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/z_image/modular_blocks_z_image.py#L231)

Similar to encoder steps, the user should also be able to run a decoder step in standalone - so each block should just decode latent + safety checker, nothing else

The action-related code in the current block isn't decoding - I think it can probably go into its own block

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation modular-pipelines size/L PR with diff > 200 LOC tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants